- Parametrized Models
- Symbols – similar to Factor Graphs
- Bubbles
- Black = observed variables
- Blue = computed variable
- Round blue shape
- Direction == easy to compute in this dir
- Deterministic functions
- Red square
- Cost function
- single scalar output
- Loss Function
- Minimization by gradient based methods
- Can easily find the gradient of a function
- function is differentiable
- almost everywhere
- should be continuous
- can have kinks
- Gradient Descent
- There are algorithms that aren't gradient based
- staircase type
- don't know a function / can't get a gradient
- zero'th order methods / gradient free methods
- whole family of these methods
- used in reinforcement learning
- where the cost isn't differentiable
- (cost becomes a black box)
- can apply gradient estimation
- very inefficient for high dimensions with a huge space to search
- Can have a critic method Actor Critic/AAC/etc.
- By training a "C" module that is differentiable to estimate the cost function
- Reward is negative of a cost
- For batches, roughly use number of categories (or 2x) for batch size
- Neural Nets
- Backprop
- Pytorch
- import nn from torch
- make a class fo the net (nn.Module)
- Linear layers
- Chain rule for vector functions
- Jacobian Matrix
- Can turn a graph into a graph that computes the gradients to backpropagate the gradient
- Can be very complex if the architecture is data dependent
- Modules used in neural nets
- used because they're optimized
- Linear: Y = W.X
- ReLU: y = ReLU(x)
- Duplicate: y1 = x ; y2 = x
- Used when wire splits into two
- Add: y = x1 + x2
- Max: y = max(x1, x2)
- LogSoftMax: y = xi - log(sumj exj )
- Softmax
- Sigmoid used with asymptotes doesn't work very well for classification
- sigmoid at its extremes is very small because sigma is flat
- this leads to the saturation problem
- Solutions
- Set targets in between instead of 1/0 (eg. .8 and .2)
- Or take the log of it
- Taking the log of the sigmoid
- S - log(1 + eS )
- large S ~ S
- small S is dominated by log
- doesn't saturate! – no vanishing gradients
- Tricks
- Use ReLU non linearities – works well for many layers (scaling invariant)
- Cross entropy loss – log softmax is a simpler special case
- Stochastic gradient on minibatches
- Shuffle the training samples
- Otherwise the last layer just learns the current type of input
- Normalize inputs to 0 mean and unit variance
- Can use it on rgb as well
- the channels have very different means
- Schedule a decrease of the learning rate
- Dropout regularization
- l2 -> at every update, weight decay
- L = C() + (alpha) * R(w); R(w) = ||w||2
- Leads to shrinking the weight at every iteration
- l1 -> R(z) = sum over i |wi |
- "lasso"
- least absolute shrinkage and selection operator
- Additional references
- Efficient Backprop
- Tricks neural network
- Any directed acyclic graph is ok for backprop
- Lab
- Neural networks are rotation and squishing
- Draw inputs at the bottom
- Having a high dimension intermediate representation is very helpful or simply have more hidden layers
- Because the number of connections grow significantly
- Logit output of final layer
- loss is cross entropy / negative log likelihood
- Choice of activation function is very important
- bunch of networks with initial values – get variance to understand uncertainty in predictions